STAT 301 Final Group Report¶
suppressWarnings(suppressMessages({
library(AER)
library(boot)
library(broom)
library(caTools)
library(caret)
library(cowplot)
library(dplyr)
library(ggplot2)
library(GGally)
library(glmnet)
library(grid)
library(gridExtra)
library(infer)
library(modelr)
library(pROC)
library(repr)
library(doParallel)
library(reshape2)
library(tidyverse)
}))
cat("All necessary packages have been loaded successfully.")
All necessary packages have been loaded successfully.
Turning Over Turnover: A Comparative Exploration of Standard, Ridge, and Lasso Logistic Regressions for Employee Attrition Prediction¶
Group Members:
- Vivaan Jhaveri (39723044)
- Ethan Rajkumar (55024616)
- Michael Wang (32981300)
- Ruhani Kaur ( )
Introduction¶
Employee turnover is a critical challenge for organizations, with far-reaching consequences for productivity, morale, and financial performance. High turnover rates not only increase recruitment and training costs but also disrupt team dynamics and result in the loss of institutional knowledge. Retaining skilled employees is also essential for maintaining a competitive edge. Understanding the factors that drive employee turnover is key to developing effective retention strategies, making this a vital area of study for both practitioners and researchers.
Extensive prior research has identified key predictors of employee turnover, including job satisfaction, organizational commitment, and external job opportunities. Alkahtani (2015) highlights seven critical factors influencing turnover, such as perceived organizational support, supervisor support, and organizational justice. Kanchana and Jayathilaka (2023) demonstrated the significant impact of gender, age, and managerial interaction on turnover, emphasizing the importance of fostering employee engagement. Similarly, Alkaabi et al. (2024) underscored the roles of leadership efficacy, corporate culture, and career advancement opportunities, advocating for strategies such as leadership development programs and flexible work schedules to mitigate turnover risks.
Building on this foundation, this report will use the Employee dataset, a comprehensive resource containing anonymized data on 4,653 employees. The dataset, sourced from Kaggle, provides information on employee demographics, job characteristics, and work status within the organization.
The nine key variables in this dataset include:
Education: Categorical variable representing the highest level of education attained by the employee ("Bachelors" "Masters", "PhD").JoiningYear: Numerical variable representing the year the employee joined the company.City: Categorical variable representing the city where the employee is located ("New Delhi", "Bangalore" "Pune").PaymentTier: Categorical variable representing the different salary tiers (1, 2, 3).Age: Numerical variable representing the age of the employee.Gender: Categorical variable representing the gender of the employee ("Male", "Female").EverBenched: Categorical (binary) variable representing whether the employee has ever been "benched" ("Yes") or not ("No").ExperienceInCurrentDomain: Numerical variable representing years of experience the employee has.LeaveOrNot: Binary response variable representing whether the employee left the company (1) or stayed (0).
This dataset provides a framework for identifying the factors that influence employee retention. By analyzing variables such as compensation(PaymentTier), benching status(EverBenched), and professional experience(ExperienceInCurrentDomain) and more, we aim to uncover actionable insights into turnover dynamics.
Our research employs logistic regression alongside ridge and lasso regression to predict employee turnover and assess the relative importance of key predictors. Logistic regression offers interpretability and identifies significant predictors, while ridge and lasso regression introduce regularization to address multicollinearity and improve model performance. This approach allows us to address the following questions:
- How can logistic regression, logistic regression with ridge regularization, and logistic regression with lasso regularization be used to predict employee turnover?
- Moreover, how do these methods compare in their ability to identify influential factors, provide model interpretability, and achieve predictive performance?
By addressing these questions, our study contributes to the ongoing discourse on employee retention, providing practical strategies for organizations to build more stable and engaged workforces. The findings aim to guide HR professionals in designing data-driven interventions to improve employee satisfaction and reduce turnover.
Methods and Results¶
Exploratory Data Analysis (EDA)¶
- Demonstrate that the dataset can be read into R.
- Clean and wrangle your data into a tidy format.
- Plot the relevant raw data, tailoring your plot to address your question.
- Make sure to explore the association of the explanatory variables with the response.
- Any summary tables that are relevant to your analysis.
- Be sure not to print output that takes up a lot of screen space.
- Your EDA must be comprehensive with high quality plots.
The dataset will be split into training and testing subsets to ensure proper model evaluation and reduce overfitting. The training data (employee_train.csv) is used for exploratory data analysis, while the test data (employee_test.csv) will later be used to evaluate the model's performance. The LeaveOrNot variable is converted into a factor to facilitate analysis across different levels. Numerical and categorical variables are separated to ensure proper visualizations and statistical summaries.
# Main developer: Ethan Rajkumar
# Contributors: Vivaan Jhaveri, Michael Wang, Ruhani Kaur
employee_train <- read.csv("data/employee_train.csv")
employee_test <- read.csv("data/employee_test.csv")
Let's visualize employee_train by taking a look at the categorical variables and the box plot medians.
# Main developer: Ethan Rajkumar
# Contributors: Vivaan Jhaveri, Michael Wang, Ruhani Kaur
options(warn = -1)
# Ensure 'LeaveOrNot' is a factor for proper grouping
employee_train <- employee_train %>%
mutate(LeaveOrNot = as.factor(LeaveOrNot))
# Calculate variance for each numerical variable by LeaveOrNot
numeric_vars <- employee_train %>% select(where(is.numeric), LeaveOrNot)
numeric_vars_names <- names(numeric_vars)
categorical_vars <- employee_train %>% select(-numeric_vars_names, -LeaveOrNot)
categorical_vars$LeaveOrNot <- employee_train$LeaveOrNot
Cardinality Plots¶
The following cardinality plots are generated for categorical variables and display the proportion of employees who stayed versus left.
# Main developer: Ethan Rajkumar
cardinality_plots <- lapply(names(categorical_vars)[-ncol(categorical_vars)], function(var) {
ggplot(categorical_vars, aes(x = .data[[var]], fill = LeaveOrNot)) +
geom_bar(position = "fill") +
labs(x = var, y = "Proportion") +
theme_minimal()
})
grid.arrange(
grobs = cardinality_plots,
ncol = 2, # Adjust as needed for layout
top = textGrob("Proportion of Categorical Variables by LeaveOrNot",
gp = gpar(fontsize = 15, fontface = "bold")) # Customize title size and style here
)
Box Plots for Numerical Variables¶
The following boxplots are generated to explore the distribution and central tendency of numerical variables.
# Main developer: Ethan Rajkumar
box_plots <- lapply(names(numeric_vars)[-ncol(numeric_vars)], function(var) {
ggplot(numeric_vars, aes(x = factor(LeaveOrNot), y = .data[[var]], fill = factor(LeaveOrNot))) +
geom_boxplot() +
theme_minimal()
})
grid.arrange(
grobs = box_plots,
ncol = 2, # Adjust layout as needed
top = textGrob("Box Plots of Numeric Variables by LeaveOrNot",
gp = gpar(fontsize = 15, fontface = "bold"))
)
Pairwise Plots¶
The following pairwise plots are generated to visualize the relationships and potential multicollinearity between numerical predictors.
# Main developer: Ethan Rajkumar
options(repr.plot.width = 12, repr.plot.height = 9)
# Create ggpairs plot for all numeric variables
suppressMessages(ggpairs(numeric_vars,
aes(color = LeaveOrNot, fill = LeaveOrNot),
title = "Pairwise Relationships of Numeric Variables by LeaveOrNot",
upper = list(continuous = wrap("cor", size = 4)),
lower = list(continuous = wrap("points", alpha = 0.3, size = 1)),
diag = list(continuous = wrap("densityDiag"))))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`. `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.